CMSC330 Final Project

Jetson Ku


For this project I am using a datset from Kaggle.com user Hugo Mathien titled "European Soccer Database". The format is an SQLite database with a few different tables. The ones that we are primarily interested in here are: Matches - historical data for individual matches from 2008-2016 with betting odds, and Player_Attributes - information from Electronic Arts' FIFA soccer video game.

Oddsmakers for sportsbooks (sports gambling companies) have a very difficult task in that they must outpredict the market. The odds are tipped slightly in their favor, as we will find out, but the margins are tight enough that there's a chance they can be beat. In this notebook, we are going to explore a few avenues that machine learning might afford us, and gives us an insight into how these odds might be generated in the first place.

Let's import the data and take the tables that we need from the schema. Since it's an SQL database, we need to use queries to pull them.

Let's take a look at what we're working with.

115 columns is a little overwhelming. For each of the gambling companies in the dataframe, let's see how they historically favor teams. We're going to look at most of this analysis from the perspective of the home team.

Now that we calculated all those statistics, let's interpret.

So most companies are only right ~53% of the time! You may be thinking, "I can guess more than 53% of soccer winners right! Pshh!". The second number, the margins, is where they get you. To convert from decimal odds, which we have here, to implied probability, we divide 1 by the odds.

For example, you place 5 dollars on Manchester City to beat Liverpool at 1.5 odds. This means: If City wins, you walk away with $7.50 in your pocket (including the original 5). If City loses, you lose the 5 dollars altogether. The implied probability in this case is 1/1.5 = 0.66666666666. That leaves only 0.33333... for both of the other outcomes: a Liverpool win and a draw. What these companies will do, is set Liverpool's odds such that the implied probability is 0.23 and a draw is 0.12. If we add these all up, we get more than 1. And this is the margin. Companies offer lower odds and therefore lower payouts by implying that each team has a better chance to win than what is considered fair.

Let's calibrate the accuracy of the industry favorites vs actual winners.

We make a scatterplot of predicted win probability vs relative win percentage.

Those discrepancies between the red and blue line are what we look to exploit. The so-called "value"

What does the distribution of outcomes look like compared to favored?

So rarely are the best odds on a match for a tie outcome. They occur way more often than they are made the favorite.

Here's where things can get interesting. What if we can use the attributes of the real-life players in a virtual game to predict the outcome of real-life matches? That's the idea of the rest of this analysis. My hypothesis is that as the difference in average rating goes up, a team is more likely to win.

First we need to join the dataframes of player attributes and matches. For each row in matches (1 match), there are 22 columns for the 22 players that start the match. We want to find the average rating of the players listed for each team. A few merges would do the trick, but the problem is that for each player in the player attributes table, there are multiple iterations of ratings that are updated over time. It would make sense to take the most recent rating for a player before a match, so that is what we do here.

Now that we have that data together, let's investigate how average rating affects the home team's win percentage.

This is certainly interesting. We see how higher ratings affect win percentages and the noise that makes it so hard to predict. This looks like a problem for a classifier.

Now we'll define the functions to evaluate models and select the best one.

And test!

Well we aren't too far off! But... all the models are in the red slightly. The best appears to be LogisticRegression.

What if we just use the difference as a predictor?

It got worse -_- This perhaps suggests that my hypothesis is wrong. Perhaps teams' ratings are independent? When you think about this idea, it makes sense. A team's quality, is NOT on the opponent's quality. Their performance may be, but it is also just as dependent on them. Although we weren't able to outpredict Vegas, we got pretty close. Modern day oddsmakers will likely use machine learning to keep their numbers profitable. Even though it's impossible to be perfect, being as perfect in prediction as possible gives you the biggest advantage as a company because there is little value for customers to exploit. But that's what makes the game fun, and addictive. No matter the power of Goliath, there is always that small belief in each person who considers themself David, armed with historical datasets and machine learning models.